36 research outputs found

    The Case for a Factored Operating System (fos)

    Get PDF
    The next decade will afford us computer chips with 1,000 - 10,000 cores on a single piece of silicon. Contemporary operating systems have been designed to operate on a single core or small number of cores and hence are not well suited to manage and provide operating system services at such large scale. Managing 10,000 cores is so fundamentally different from managing two cores that the traditional evolutionary approach of operating system optimization will cease to work. The fundamental design of operating systems and operating system data structures must be rethought. This work begins by documenting the scalability problems of contemporary operating systems. These studies are used to motivate the design of a factored operating system (fos). fos is a new operating system targeting 1000+ core multicore systems where space sharing replaces traditional time sharing to increase scalability. fos is built as a collection of Internet inspired services. Each operating system service is factored into a fleet of communicating servers which in aggregate implement a system service. These servers are designed much in the way that distributed Internet services are designed, but instead of providing high level Internet services, these servers provide traditional kernel services and manage traditional kernel data structures in a factored, spatially distributed manner. The servers are bound to distinct processing cores and by doing so do not fight with end user applications for implicit resources such as TLBs and caches. Also, spatial distribution of these OS services facilitates locality as many operations only need to communicate with the nearest server for a given service

    Vote the OS off your Core

    Get PDF
    Recent trends in OS research have shown evidence that there are performance benefits to running OS services on different cores than the user applications that rely on them. We quantitatively evaluate this claim in terms of one of the most significant architectural constraints: memory performance. To this end, we have created CachEMU, an open-source memory trace generator and cache simulator built as an extension to QEMU for working with system traces. Using CachEMU, we determined that for five common Linux test workloads, it was best to run the OS close, but not too close on the same package, but not on the same core

    Remote Store Programming: Mechanisms and Performance

    Get PDF
    This paper presents remote store programming (RSP). This paradigm combines usability and efficiency through the exploitation of a simple hardware mechanism, the remote store, which can easily be added to existing multicores.Remote store programs are marked by fine-grained and one-sided communication which results in a stream of data flowing from the registers of a sending process to the cache of a destination process. The RSP model and its hardware implementation trade a relatively high store latency for a low load latency because loads are more common than stores, and it is easier to tolerate store latency than load latency. This paper demonstrates the performance advantages of remote store programming by comparing it to both cache-coherent shared memory and direct memory access (DMA) based approaches using the TILEPro64 processor. The paper studies two applications: a two-dimensional Fast Fourier Transform (2D FFT) and an H.264 encoder for high-definition video. For a 2D FFT using 56 cores, RSP is 1.64x faster than DMA and 4.4x faster than shared memory. For an H.264 encoder using 40 cores, RSP achieves the same performance as DMA and 4.8x the performance of shared memory. Along with these performance advantages, RSP requires the least hardware support of the three. RSP's features, performance, and hardware simplicity make it well suited to the embedded processing domain

    Core Count vs Cache Size for Manycore Architectures in the Cloud

    Get PDF
    The number of cores which fit on a single chip is growing at an exponential rate while off-chip main memory bandwidth is growing at a linear rate at best. This core count to off-chip bandwidth disparity causes per-core memory bandwidth to decrease as process technology advances. Continuing per-core off-chip bandwidth reduction will cause multicore and manycore chip architects to rethink the optimal grain size of a core and the on-chip cache configuration in order to save main memory bandwidth. This work introduces an analytic model to study the tradeoffs of utilizing increased chip area for larger caches versus more cores. We focus this study on constructing manycore architectures well suited for the emerging application space of cloud computing where many independent applications are consolidated onto a single chip. This cloud computing application mix favors small, power-efficient cores. The model is exhaustively evaluated across a large range of cache and core-count configurations utilizing SPEC Int 2000 miss rates and CACTI timing and area models to determine the optimal cache configurations and the number of cores across four process nodes. The model maximizes aggregate computational throughput and is applied to SRAM and logic process DRAM caches. As an example, our study demonstrates that the optimal manycore configuration in the 32nm node for a 200 mm^2 die uses on the order of 158 cores, with each core containing a 64KB L1I cache, a 16KB L1D cache, and a 1MB L2 embedded-DRAM cache. This study finds that the optimal cache size will continue to grow as process technology advances, but the tradeoff between more cores and larger caches is a complex tradeoff in the face of limited off-chip bandwidth and the non-linearities of cache miss rates and memory controller queuing delay

    Using LLMs to Facilitate Formal Verification of RTL

    Full text link
    Formal property verification (FPV) has existed for decades and has been shown to be effective at finding intricate RTL bugs. However, formal properties, such as those written as SystemVerilog Assertions (SVA), are time-consuming and error-prone to write, even for experienced users. Prior work has attempted to lighten this burden by raising the abstraction level so that SVA is generated from high-level specifications. However, this does not eliminate the manual effort of reasoning and writing about the detailed hardware behavior. Motivated by the increased need for FPV in the era of heterogeneous hardware and the advances in large language models (LLMs), we set out to explore whether LLMs can capture RTL behavior and generate correct SVA properties. First, we design an FPV-based evaluation framework that measures the correctness and completeness of SVA. Then, we evaluate GPT4 iteratively to craft the set of syntax and semantic rules needed to prompt it toward creating better SVA. We extend the open-source AutoSVA framework by integrating our improved GPT4-based flow to generate safety properties, in addition to facilitating their existing flow for liveness properties. Lastly, our use cases evaluate (1) the FPV coverage of GPT4-generated SVA on complex open-source RTL and (2) using generated SVA to prompt GPT4 to create RTL from scratch. Through these experiments, we find that GPT4 can generate correct SVA even for flawed RTL, without mirroring design errors. Particularly, it generated SVA that exposed a bug in the RISC-V CVA6 core that eluded the prior work's evaluation.Comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibl

    Distributed data structure for factored operating systems

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 151-158).Future computer architectures will likely exhibit increased parallelism through the addition of more processor cores. Architectural trends such as exponentially increasing parallelism and the possible lack of scalable shared memory motivate the reevaluation of operating system design. This thesis work takes place in the context of Factored Operating Systems which leverage distributed system ideas to increase the scalability of multicore processor operating systems. fos, a Factored Operating System, explores a new design point for operating systems where traditional low-level operating system services are fine-grain parallelized while internally only using explicit message passing for communication. fos factors an operating system first by system service and then further parallelizes inside of the system service by splitting the service into a fleet of server processes which communicate via messaging. Constructing parallel low-level operating system services which only internally use messaging is challenging because shared resources must be partitioned across servers and the services must provide scalable performance when met with uneven demand. To ease the construction of parallel fos system services, this thesis develops the dPool distributed data structure. The dPool data structure provides concurrent access to an unordered collection of elements by server processes within a fos fleet. Internal to a single dPool instance, all communication between different portions of a dPool is done via messaging. This thesis uses the dPool data structure within the parallel fos Physical Memory Allocation fleet and demonstrates that it is possible to use a dPool to manage shared state in a factored operating system's physical page allocator. This thesis begins by presenting the design of the prototype fos operating system. In the context of fos system service fleets, this thesis describes the dPool data structure, its design, different implementations, and interfaces. The dPool data structure is shown to achieve scalability across even and uneven micro-benchmark workloads. This thesis shows that common parallel and distributed programming techniques apply to the creation of dPool and that background threads within a dPool can increase performance. Finally, this thesis evaluates different dPool implementations and demonstrates that intelligently pushing elements between dPool parts can increase scalability.by David Wentzlaff.Ph.D

    A Unified Operating System for Clouds and Manycore: fos

    Get PDF
    Single chip processors with thousands of cores will be available in the next ten years and clouds of multicore processors afford the operating system designer thousands of cores today. Constructing operating systems for manycore and cloud systems face similar challenges. This work identifies these shared challenges and introduces our solution: a factored operating system (fos) designed to meet the scalability, faultiness, variability of demand, and programming challenges of OSâ s for single-chip thousand-core manycore systems as well as current day cloud computers. Current monolithic operating systems are not well suited for manycores and clouds as they have taken an evolutionary approach to scaling such as adding fine grain locks and redesigning subsystems, however these approaches do not increase scalability quickly enough. fos addresses the OS scalability challenge by using a message passing design and is composed out of a collection of Internet inspired servers. Each operating system service is factored into a set of communicating servers which in aggregate implement a system service. These servers are designed much in the way that distributed Internet services are designed, but provide traditional kernel services instead of Internet services. Also, fos embraces the elasticity of cloud and manycore platforms by adapting resource utilization to match demand. fos facilitates writing applications across the cloud by providing a single system image across both future 1000+ core manycores and current day Infrastructure as a Service cloud computers. In contrast, current cloud environments do not provide a single system image and introduce complexity for the user by requiring different programming models for intra- vs inter-machine communication, and by requiring the use of non-OS standard management tools
    corecore